MARKET BASKET ANALYSIS
Alparslan Erol
09/02/2021
The goal of this project is to analyze the market basket of consumers. By analyzing baskets, insights related to consumer purchasing behaviors can be obtained. To achieve our goal, a data-set is downloaded from the kaggle. The data-set has 38765 rows of the purchase orders of people from the grocery stores. These orders can be analyzed and association rules can be generated using Market Basket Analysis by algorithms like Apriori Algorithm. Even time series analysis could be conducted but it is not the focus of the this research.
import pandas as pd
import numpy as np
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
import plotly_express as px
from apyori import apriori
import networkx as nx
from fa2 import ForceAtlas2
import random
import itertools
df = pd.read_csv("Groceries_dataset.csv")
df.head()
Wrangling Dataframe
Casting and printing ETL types of attributes;
df["Member_number"] = df["Member_number"].apply(str)
df["itemDescription"] = df["itemDescription"].apply(str)
df["Date"] = pd.to_datetime(df["Date"])
print(df.dtypes)
df.sort_values(by=["Member_number", "Date"], inplace=True)
df.reset_index(drop=True, inplace=True)
df["value"] = 1
df.head()
Histogram
From the histogram below, we can analyze the most/least frequent purchased items. Since number of items is huge, plot is stacked. Since it is a plotly plot, to further observe, you can hover the mouse on the bars and examine the number of occurrences for each item. As we can see, whole milk, other vegetables, rolls/buns are top three items purchased most frequently.
fig_hist = px.histogram(df, "itemDescription", color_discrete_sequence=px.colors.diverging.Spectral,\
title="Histogram for Market Basket Analysis", labels={"itemDescription":"Items in Basket"})
fig_hist.update_layout(yaxis_title_text="Number of Occurrences (Count)")
fig_hist.update_xaxes(tickangle=45, categoryorder="total descending")
fig_hist.show()
Preparing pivot table for association rule analysis
df_pivot = df.pivot_table(index=["Member_number", "Date"], columns=["itemDescription"], values=["value"], fill_value=0)
column_names = []
for i, j in df_pivot.columns:
column_names.append(j)
df_pivot.columns = column_names
df_pivot.head()
Creating a seperate dataframe for number of occurrences for each item.
df_count = pd.DataFrame(df.groupby(by=["itemDescription"])["value"].sum().reset_index())\
.rename(columns={"value":"count"}).sort_values(by=["count"], ascending=[False]).reset_index(drop=True)
df_count.head()
Word Cloud
A word cloud is a visual representation of word frequency. The more commonly the term appears within the text being analysed, the larger the word appears in the image generated. Word clouds are a simple tool to identify the focus of written material. _"dfcount" has the frequencies of item for the word cloud. As you can see below, again whole milk, other vegetables, rolls/buns, soda are printed larger than frankfurter, curd etc.
wc = WordCloud(max_words=3000, background_color="white")
freq_dict = df_count.set_index('itemDescription').T.to_dict('records')
cloud = wc.generate_from_frequencies(freq_dict[0])
plt.figure(figsize=(15, 25))
plt.imshow(cloud, interpolation='bilinear')
#cloud.to_file('word_cloud.png')
plt.title("Word Cloud for Market Basket Analysis", fontsize=18)
plt.axis("off")
plt.show()
Association Rule Learning
Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.[1] Under the domain of market basket research, main goal is to find the relation between the purchased items. For example, let's assume following rule;
$${\displaystyle \{\mathrm {loaf, yogurt} \}\Rightarrow \{\mathrm {milk} \}}$$
Above rule indicates that consumers who purchase the loaf and yogurt together will most likely purchase milk, either. This kind of information can be used in the sales strageties such promotion, pricing, bundling etc.
Concepts for Understanding[2]
Support:
Support is an indication of how frequently the itemset appears in the dataset.
The support of X with respect to T is defined as the proportion of transactions t in the dataset which contains the itemset X.
$${\displaystyle \mathrm {supp} (X)={\frac {|\{t\in T;X\subseteq t\}|}{|T|}}}$$Let's say we have 4 transactions in total and {loaf, yogurt} is unique and one of them. Then the support of {loaf, yogurt} would be 25%. Let's assume we have another unique transaction which is {loaf, yogurt, milk}. Then the support of {loaf, yogurt, milk} would be 25%, again.
Confidence:
Confidence is an indication of how often the rule has been found to be true.
The confidence value of a rule, X -> Y , with respect to a set of transactions T, is the proportion of the transactions that contains X which also contains Y.
Confidence is defined as:
$${\mathrm{conf}}(X\Rightarrow Y)={\mathrm {supp}}(X\cup Y)/{\mathrm {supp}}(X)$$We have given examples in support section. Let's assume all examples above are still valid. Then;
$${\displaystyle \mathrm{conf}(\{\mathrm {loaf, yogurt} \}\Rightarrow \{\mathrm {milk} \}}) == 1.0 $$
Because we assumed that these transactions are unique among other transactions.
References
Preparing a seperate dataframe for association rule learning.
analysis_df = df.groupby(['Member_number','Date'])['itemDescription'].apply(','.join).reset_index()
analysis_df["itemDescription"] = analysis_df["itemDescription"].apply(lambda row: row.split(","))
analysis_list = list(analysis_df.itemDescription)
print(analysis_list[0:5])
analysis_df.head()
Preparing the run_apriori_algorithm method for running apriori algorithm.
def run_apriori_algorithm(list_for_items, min_support, min_confidence):
rules = apriori(list_for_items, min_support=min_sup, min_confidence=min_conf)
frules = []
for r in rules:
for o in r.ordered_statistics:
conf = o.confidence
supp = r.support
x = list(o.items_base)
y = list(o.items_add)
temp_list = (x, y, supp, conf)
#print("{%s} -> {%s} (supp: %.3f, conf: %.3f)"%(x,y, supp, conf))
frules.append(temp_list)
cols = ["{X} ->", "{Y}", "Support (>%s)"%min_sup, "Confidence (>%s)"%min_conf]
result_df = pd.DataFrame(frules, columns=cols).sort_values(by="Confidence (>%s)"%min_conf, ascending=False).reset_index(drop=True)
return result_df
min_sup = 0.01
min_conf = 0.1
resulting_df = run_apriori_algorithm(analysis_list, min_sup, min_conf)
resulting_df
Comments for Association Rule Learning
min_sup = 0.001
min_conf = 0.01
resulting_df = run_apriori_algorithm(analysis_list, min_sup, min_conf)
resulting_df.head(30)
resulting_df.shape
Comments for Association Rule Learning
Again the suggestions seems to be according to most frequent purchased items. Because Y value is always top items to purchase. On the other hand, now we have chance to examine X items more closely. Sausage item can be seen more freqeuntly now. Therefore it can be considered for the further sales strategies.
Graph & Network Analysis for Association Rule Learning
To further analyze and better visualize the association between the market basket items, I will create graphs with networkx package and optimize the node positions via forceatlas2 package.
G = nx.Graph()
item_nodes = list(df_count["itemDescription"].unique())
other_nodes = list(analysis_df["itemDescription"])
G.add_nodes_from(item_nodes)
for transaction in other_nodes:
for comb_items in list(itertools.combinations(transaction, 2)):
if G.has_edge(comb_items[0], comb_items[1]):
G[comb_items[0]][comb_items[1]]["weight"] += 1
else:
G.add_edge(comb_items[0], comb_items[1], weight = 1)
# Plot network with force atlas 2
forceatlas2 = ForceAtlas2(# Behavior alternatives
outboundAttractionDistribution=True,
linLogMode=False,
adjustSizes=False,
edgeWeightInfluence=1.0,
# Performance
jitterTolerance=1.0,
barnesHutOptimize=True,
barnesHutTheta=0.5,
multiThreaded=False,
# Tuning
scalingRatio=4.0,
strongGravityMode=False,
gravity=1.0,
# Log
verbose=True)
positions = forceatlas2.forceatlas2_networkx_layout(G, pos=None, iterations=5000) # undir_G
sizes_gm = []
colors_gm = ["#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)]) for i in range(G.number_of_nodes())]
for i in G.nodes():
sizes_gm.append(G.degree(i))
nx.draw_networkx_nodes(G, positions, with_labels=True, node_size=sizes_gm, node_color=colors_gm, alpha=0.8)
nx.draw_networkx_edges(G, positions, alpha=0.25)
plt.rcParams['figure.figsize'] = [20,20]
plt.axis('off')
plt.title('GRAPH OF MARKET BASKET ITEMS', fontsize=20)
plt.draw();
Graph & Network Analysis for Association Rule Learning
Since the graph representation above is so much crowded, I will create a new graph which is less crowded. Total number of transcation is 14963. I will select only 300 of them randomly to see a clear graph.
sampled_other_nodes = random.choices(list(analysis_df["itemDescription"]), k=300)
most_freq_items = ["whole milk", "other vegetables", "rolls/buns"]
print(len(most_freq_items))
print(len(sampled_other_nodes))
print("Printing samples from Sampled Transactions:\n", random.choices(sampled_other_nodes, k=3))
simple_G = nx.Graph()
sample_item_nodes = list(set(list(pd.core.common.flatten(sampled_other_nodes))))
simple_G.add_nodes_from(sample_item_nodes)
for transaction in sampled_other_nodes:
for comb_items in list(itertools.combinations(transaction, 2)):
if simple_G.has_edge(comb_items[0], comb_items[1]):
simple_G[comb_items[0]][comb_items[1]]["weight"] += 1
else:
simple_G.add_edge(comb_items[0], comb_items[1], weight = 1)
forceatlas2 = ForceAtlas2(# Behavior alternatives
outboundAttractionDistribution=True,
linLogMode=False,
adjustSizes=False,
edgeWeightInfluence=1.0,
# Performance
jitterTolerance=1.0,
barnesHutOptimize=True,
barnesHutTheta=0.5,
multiThreaded=False,
# Tuning
scalingRatio=4.0,
strongGravityMode=False,
gravity=1.0,
# Log
verbose=True)
undir_G = simple_G.to_undirected()
positions = forceatlas2.forceatlas2_networkx_layout(undir_G, pos=None, iterations=5000) # undir_G
sizes = []
colors = ["#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)]) for i in range(simple_G.number_of_nodes())]
for i in simple_G.nodes():
sizes.append(simple_G.degree(i))
sizes = [i*5 for i in sizes]
nx.draw_networkx_nodes(G=simple_G, pos=positions, with_labels=True, node_size=sizes, node_color=colors, alpha=0.8)
nx.draw_networkx_edges(G=simple_G, pos=positions, alpha=0.1)
nx.draw_networkx_labels(G=simple_G, pos=positions, font_size=10)
plt.rcParams['figure.figsize'] = [20,20]
plt.axis('off')
plt.title('NETWORK OF MARKET BASKET ITEMS', fontsize=20)
plt.draw();
Graph & Network Analysis for Association Rule Learning